ML Model Deployment

Why Deployment: ML in production vs. ML in research

These are important for ML in production than in research:

Workflow for Deploying ML Models to Production

Model Serving

Serving = making your model available for real users to query through an API or endpoint.

Key components

Performance metrics

Balancing metrics

Model Serving Tooling

Web/API Frameworks for Model Serving

What is an API

- An API lets two pieces of software talk to each other using the client/server architectural style. It allows programmatically accessing some data or perform some action.
 - In ML, clients send input → server returns prediction.

REST API

REST = Representational State Transfer

API Frameworks for Serving Models

Framework Notes
FastAPI Best for high-performance async inference
Flask Simple + flexible
Django Full web framework, includes ORM and auth
gRPC (non-REST) Faster binary communication, used in high-load inference

Specialized Serving Platforms

ML Deployment Approaches

Prediction Modes

Batch Prediction VS. Online Prediction

Batch Prediction (asynchronous) Online/Real-Time Prediction(synchronous)
Trigger Scheduled / periodic Predict-per-request
Best for Processing accumulated data when you don’t need immediate results: Recommender refresh, bulk scoring, churn prediction When predictions are needed as soon as a data sample is generated: Fraud detection, chatbots, personalization
Optimized for High throughput Low latency
Cost model Pay during batch jobs Pay while endpoint is active

Unifying Batch Pipeline and Streaming Pipeline

Nowadays, the two pipelines can be unified by having two teams:

Accelerating ML Model Inference

There are three main approaches to reduce its inference latency:

Model Optimization (compute faster)

Model Compression (smaller weights)

Hardware Strategies

make the hardware it’s deployed on run faster